Alison’s EDA

Alison’s EDA before collaboration

Alison Yao
11/22/2021

Load Dataset

First, read the 3 csv files downloaded from 2019 OECD study of violence against women.

attitude_df = read.csv('./Data/attitude.csv')
law_df = read.csv('./Data/laws.csv')
prevalence_df = read.csv('./Data/prevalence.csv')

Initial Findings

By just Eyeballing the data, I can see that: 1. Not all countries are listed and not all countries have record on all three features. 2. The range of three features are different (0-100% or 0-1). Strange thing is that law only has 4 values: 0.25, 0.5, 0.75 and 1. It looks more categorical than continuous. 3. The countries are abbreviated…so I need some way to match the code with the names

# Data Source: https://www.iban.com/country-codes
country_code_df = read.csv('./Data/country_code.csv')
check_code <- function(df){
   for (code in df$LOCATION){
    if (code %in% country_code_df$Abb3) {
    } else {
      print(code, 'not found!')
    }
  }
}
check_code(attitude_df)
check_code(law_df)
check_code(prevalence_df)
# There should not be any output - Awesome!

So now I have a way to match the country codes to their English names.

Join Data

I will put everything in a dataframe by full outer joining three subsets of the dataframes.

attitude_sub <- attitude_df[c('LOCATION', 'Value')]
colnames(attitude_sub) <- c('Country', 'Attitude')
law_sub <- law_df[c('LOCATION', 'Value')]
colnames(law_sub) <- c('Country', 'Law')
prevalence_sub <- prevalence_df[c('LOCATION', 'Value')]
colnames(prevalence_sub) <- c('Country', 'Prevalence')
df <- full_join(attitude_sub, law_sub, by = "Country")
df <- full_join(df, prevalence_sub, by = "Country")
head(df)
  Country Attitude  Law Prevalence
1     AUS      3.2 0.75       16.9
2     CAN      7.8 0.25        1.9
3     FIN     11.2 0.75       30.0
4     FRA      6.6 0.25       26.0
5     DEU     19.6 0.75       22.0
6     HUN      8.7 0.75       21.0

Looks good! But there must be many empty values. Let’s check.

summary(df)
   Country             Attitude          Law          Prevalence   
 Length:163         Min.   : 0.00   Min.   :0.250   Min.   : 1.90  
 Class :character   1st Qu.: 8.60   1st Qu.:0.500   1st Qu.:18.30  
 Mode  :character   Median :22.05   Median :0.750   Median :24.60  
                    Mean   :27.52   Mean   :0.592   Mean   :28.96  
                    3rd Qu.:42.52   3rd Qu.:0.750   3rd Qu.:35.00  
                    Max.   :92.10   Max.   :1.000   Max.   :85.00  
                    NA's   :11                      NA's   :34     

Indeed, there are 11 empty values in Attitude and 34 in Prevalence.

Plots

Although we already have some amazing interactive visualization on the data source website, we visualize a bit more here.

Single Variable

df %>% 
  ggplot(aes(x = Attitude)) +
  geom_histogram(binwidth=2) + 
  labs(
    title = 'Histogram of Attitudes toward violence',
    subtitle = '152 countryies included',
    x = 'Attitudes',
    y = 'Frequency'
  )

df %>% 
  ggplot(aes(x = Prevalence)) +
  geom_histogram(binwidth=2) + 
  labs(
    title = 'Histogram of Prevalence of violence in the lifetime',
    subtitle = '129 countryies included',
    x = 'Prevalence',
    y = 'Frequency'
  )

df %>% 
  ggplot(aes(x = Law)) +
  geom_bar() + 
  scale_x_continuous(breaks = c(0.25, 0.5, 0.75, 1.00)) +
  labs(
    title = 'Histogram of Laws on domestic violence',
    subtitle = '163 countryies included',
    x = 'Law',
    y = 'Frequency'
  )

This one is not so informative because there are only 4 values.

Multi-variable

df %>% 
  ggplot(aes(x = Attitude, y = Prevalence)) +
  geom_point() + 
  labs(
    title = 'Scatterplot of Prevalence vs Attitude',
    x = 'Attitude',
    y = 'Prevalence'
  )

We can see a positive correlation.

df %>% 
  ggplot(aes(x = Law, y = Prevalence)) +
  geom_point() + 
  labs(
    title = 'Scatterplot of Prevalence vs Law',
    x = 'Law',
    y = 'Prevalence'
  )

df %>% 
  ggplot(aes(x = Law, y = Attitude)) +
  geom_point() + 
  labs(
    title = 'Scatterplot of Attitude vs Law',
    x = 'Law',
    y = 'Attitude'
  )

library('plotly')
plot_ly(x=df$Attitude,
        y=df$Prevalence,
        z=df$Law,
        type="scatter3d")

This looks fancy but is pretty useless. I cannot see much from this 3D plot.